Firstly, install the packages we will use later. Before running the code, please remove the installation package.
#install.packages("reshape2")
#install.packages("Hmisc")
#install.packages("maps")
#install.packages("mapdata")
#install.packages("maptools")
#install.packages("ggsci")
#install.packages("ggradar")
#install.packages("devtools")
#install.packages("gcookbook")
#install.packages("cowplot")
#devtools::install_github("ricardo-bion/ggradar", dependencies = TRUE)
#install.packages("prophet")
library(tidyverse)
## -- Attaching packages --------------------------------------------------- tidyverse 1.3.0 --
## √ ggplot2 3.2.1 √ purrr 0.3.3
## √ tibble 2.1.3 √ dplyr 0.8.3
## √ tidyr 1.0.0 √ stringr 1.4.0
## √ readr 1.3.1 √ forcats 0.4.0
## -- Conflicts ------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(readxl)
library(dplyr)
library(ggplot2)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(gcookbook)
library(cowplot)
## Warning: package 'cowplot' was built under R version 3.6.2
##
## ********************************************************
## Note: As of version 1.0.0, cowplot does not change the
## default ggplot2 theme anymore. To recover the previous
## behavior, execute:
## theme_set(theme_cowplot())
## ********************************************************
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 3.6.2
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
library(maps)
## Warning: package 'maps' was built under R version 3.6.2
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
library(mapdata)
## Warning: package 'mapdata' was built under R version 3.6.2
library("maptools")
## Warning: package 'maptools' was built under R version 3.6.2
## Loading required package: sp
## Warning: package 'sp' was built under R version 3.6.2
## Checking rgeos availability: FALSE
## Note: when rgeos is not available, polygon geometry computations in maptools depend on gpclib,
## which has a restricted licence. It is disabled by default;
## to enable gpclib, type gpclibPermit()
##
## Attaching package: 'maptools'
## The following object is masked from 'package:Hmisc':
##
## label
library(ggsci)
## Warning: package 'ggsci' was built under R version 3.6.2
#library(xlsx)
library("dplyr")
library(ggradar)
library(forecast)
## Registered S3 method overwritten by 'xts':
## method from
## as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Registered S3 methods overwritten by 'forecast':
## method from
## fitted.fracdiff fracdiff
## residuals.fracdiff fracdiff
library(prophet)
## Warning: package 'prophet' was built under R version 3.6.2
## Loading required package: Rcpp
## Loading required package: rlang
##
## Attaching package: 'rlang'
## The following objects are masked from 'package:purrr':
##
## %@%, as_function, flatten, flatten_chr, flatten_dbl, flatten_int,
## flatten_lgl, flatten_raw, invoke, list_along, modify, prepend,
## splice
Energy production and usage are a major portion of any economy. In the United States, many aspects of energy policy are decentralized to the state level. For the final project, our goal is to use energy data of 4 states in America to characterize how the energy profile of states has evolved over time; form energy performance index to evaluate the energy performance of states; and forecast energy production and consumption, and CO2 emission.
Our final project mainly uses the 2018 MCM problem C dataset. This data set contains the energy data of four states along the U.S. border with Mexico, namely California (CA), Arizona (AZ), New Mexico (NM), and Texas (TX). It provides state energy production, consumption, price, and expenditure estimates in 1960-2009. The consumption data is specified into five energy-consuming sectors, namely residential, commercial, industrial, transportation, and electric power sector. (The first four energy-consuming sectors—residential, commercial, industrial, and transportation sectors—are also called end-use sectors.) Every variable is identified by the MSN codes in the State Energy Data System. As illustrated in the following picture:
The first 2 characters identify the type of the energy; the 3rd and 4th characters identify the energy activity or energy consuming sector; the 5th character identifies the type of the data. In the following example, MGACV is the identifying code for motor gasoline expenditures in the transportation sector in million dollars. In total, there are over 600 MSN codes. In addition, in order to enrich the analysis, we also found the CO2 emission data in the 4 states.
Firstly, we read the selected variables file and the initial energy data file, and join the variables with the initial energy data, which can help us to get the energy related data of the selected variables.
var<-read_xlsx(path = "variable.xlsx", sheet = 1)
initial_data<-read_xlsx(path = "ProblemCData.xlsx", sheet = 1)
initial_data<-left_join(var, initial_data)
## Joining, by = "MSN"
head(initial_data)
## # A tibble: 6 x 4
## MSN StateCode Year Data
## <chr> <chr> <dbl> <dbl>
## 1 BMTCB AZ 1960 4013.
## 2 BMTCB AZ 1961 3837.
## 3 BMTCB AZ 1962 3672.
## 4 BMTCB AZ 1963 4028.
## 5 BMTCB AZ 1964 4089.
## 6 BMTCB AZ 1965 3695.
Next, we use spreading to convert the long data to the wide data. Looking at the wide data, we can find that there are some variables that have no data at all, so we delete these variables.
wide_data<-spread(initial_data, key = MSN, value = Data)
wide_data<-wide_data[-201,]
wide_data<-wide_data[,-which(colnames(wide_data) %in% c("CCEXB","CCEXD","CCEXV","CCIMB","CCIMD","CCIMV","CCNIB","CCNIV","NUETK"))]
#wide_data<-wide_data[,-111]
head(wide_data)
## # A tibble: 6 x 145
## StateCode Year BMTCB CLACB CLACD CLACK CLACV CLCCB CLCCD CLCCV CLEIB CLEID
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AZ 1960 4013. 3.54 NA 21.6 NA 0 NA NA 0 NA
## 2 AZ 1961 3837. 0.949 NA 21.6 NA 0 NA NA 0 NA
## 3 AZ 1962 3672. 0.902 NA 21.6 NA 0 NA NA 6065. NA
## 4 AZ 1963 4028. 0.594 NA 21.5 NA 0 NA NA 8272. NA
## 5 AZ 1964 4089. 0.416 NA 21.5 NA 0 NA NA 8990. NA
## 6 AZ 1965 3695. 0.301 NA 21.4 NA 0 NA NA 6938. NA
## # ... with 133 more variables: CLEIK <dbl>, CLEIV <dbl>, CLHCK <dbl>,
## # CLICB <dbl>, CLICD <dbl>, CLICV <dbl>, CLKCB <dbl>, CLKCD <dbl>,
## # CLKCK <dbl>, CLKCV <dbl>, CLOCB <dbl>, CLOCD <dbl>, CLOCK <dbl>,
## # CLOCV <dbl>, CLOSB <dbl>, CLPRB <dbl>, CLPRK <dbl>, CLRCB <dbl>,
## # CLRCD <dbl>, CLRCV <dbl>, CLRFB <dbl>, CLTCB <dbl>, CLTCD <dbl>,
## # CLTCV <dbl>, CLTXB <dbl>, CLTXD <dbl>, CLTXV <dbl>, CLXCD <dbl>,
## # CLXCV <dbl>, COICB <dbl>, COPRK <dbl>, EMFDB <dbl>, FSICB <dbl>,
## # FSICD <dbl>, FSICV <dbl>, GETCB <dbl>, GETXB <dbl>, GOCCB <dbl>,
## # GORCB <dbl>, HYCCB <dbl>, HYEGB <dbl>, HYICB <dbl>, HYTCB <dbl>,
## # HYTXB <dbl>, NAICB <dbl>, NGACB <dbl>, NGACD <dbl>, NGACP <dbl>,
## # NGACV <dbl>, NGCCB <dbl>, NGCCD <dbl>, NGCCV <dbl>, NGEIB <dbl>,
## # NGEID <dbl>, NGEIK <dbl>, NGEIV <dbl>, NGICB <dbl>, NGICD <dbl>,
## # NGICV <dbl>, NGISB <dbl>, NGLPB <dbl>, NGMPB <dbl>, NGMPK <dbl>,
## # NGPZB <dbl>, NGRCB <dbl>, NGRCD <dbl>, NGRCV <dbl>, NGRFB <dbl>,
## # NGTCB <dbl>, NGTCD <dbl>, NGTCK <dbl>, NGTCV <dbl>, NGTXB <dbl>,
## # NGTXD <dbl>, NGTXK <dbl>, NGTXV <dbl>, NGVHB <dbl>, NNACB <dbl>,
## # NNCCB <dbl>, NNEIB <dbl>, NNICB <dbl>, NNRCB <dbl>, NNTCB <dbl>,
## # NUEGB <dbl>, NUEGD <dbl>, NUEGV <dbl>, NUETB <dbl>, NUETD <dbl>,
## # NUETV <dbl>, PAPRB <dbl>, PMTCB <dbl>, REPRB <dbl>, RETCB <dbl>,
## # ROPRB <dbl>, SGICB <dbl>, SOEGB <dbl>, SOHCB <dbl>, SOTCB <dbl>,
## # SOTXB <dbl>, TEACB <dbl>, ...
We write the data as a csv file.
write.csv(wide_data,"wide_data.csv")
We can find that there are some missing values in the table. The usual way to fill in missing values is to fill them with mean, mode, or 0. But we try to fill in missing values using the greyscale prediction method(灰度预测), which is based on the trend of existing values to predict. Because most of the missing values are from 1960 to 1969, we reverse the time series and use data from 2009 to 1970 to make predictions. Besides, there is one variable with missing values from 1960 to 1979, We deal with it alone, using data from 2009 to 1980 to make predictions.
# this code is used to compute GM(1,1) greyscale prediction method.
# input: x is the initial data, adv is the forecast step,
# output: actual is the initial data, fit is the fitting data, degree is the fitting precision.
# C is Posterior difference ratio,P is Small probability error,predict Extrapolation forecast value
# main function
GM <- function(x,adv=0) {
x0 <- x
k = length(x0)
# AGO
x1 = cumsum(x0)
# construct matrix B & Y
B = cbind(-0.5*(x1[-k]+x1[-1]),rep(1,times=k-1))
Y = x0[-1]
# compute BtB...
BtB = t(B)%*%B
BtB.inv = solve(BtB)
BtY = t(B)%*%Y
# estimate
alpha = BtB.inv%*%BtY
# predict model
predict <- function(k) {
y = (x0[1] - alpha[2]/alpha[1])*exp(-alpha[1]*k)+alpha[2]/alpha[1]
return(y)
}
pre <- sapply(X=0:(k-1),FUN=predict)
prediff <- c(pre[1],diff(pre))
# calculate residual
error <- round(abs(prediff-x0),digits=6)
emax <- max(error)
emin <- min(error)
# evaluate model
incidence <- function(x) {
return((emin + 0.5*emax)/(x+0.5*emax))
}
degree <- mean(sapply(error,incidence))
s1 <- sqrt(sum((x0-mean(x0))^2)/5)
s2 <- sqrt(sum((error-mean(error))^2)/5)
C <- s2/s1
e <- abs(error-mean(error))
p <- length(e<0.6745)/length(e)
result <- list(actual = x0,
fit = prediff,
degree = degree,
C = C,
P = p)
# predicts k+adv
if (adv > 0) {
pre.adv <- predict(k+adv-1)-predict(k+adv-2)
result$predict <- pre.adv
}
class(result) <- 'GM1.1'
return(result)
}
# print & plot
print.GM1.1 <- function(mod){
cat('the result of GM(1,1)\n')
cat('Actual Data:', '\n',mod$actual ,'\n')
cat('Fit Data:', '\n',round(mod$fit,2) ,'\n')
cat('Degree:', round(mod$degree,3) ,'\n')
cat('C:', round(mod$C,3) ,'\n')
cat('P:', round(mod$P,3) ,'\n')
if (!is.null(mod$predict)) {
cat('Predict Data:', round(mod$predict,2), '\n')
}
}
plot.GM1.1 <- function(mod,adv=5){
prex <- numeric(adv)
x <- mod$actual
for (k in 1:adv){
prex[k] <- GM(x,k)$predict
}
value = c(x,prex)
res <- data.frame(index = 1:length(value),
value = value,
type = factor(c(rep(1,length(x)),rep(2,length(prex)))))
library(ggplot2)
p <- ggplot(res,aes(x=index,y= value))
p + geom_point(aes(color=type),size=3)+
geom_path(linetype=2) +
theme_bw()
}
result = tryCatch(
{expr},
warning = function(w) {warning-handler-code},
error = function(e) { error-handler-code}
)
tryCatch({a<-"c"
b<-"c"
b==a},
error=function(e){cat("hahaha",conditionMessage(e),"\n\n")},
finally={print("ccc")})
## [1] "ccc"
## [1] TRUE
tryCatch(
{ a<-"c"
b<-"c"
b==a},
warning = function(w) {cat("WARNING :",conditionMessage(e),"\n")},
error=function(e){cat("ERROR :",conditionMessage(e),"\n")}
)
## [1] TRUE
# predict
for(i in 3:145){
if(is.na(wide_data[1,i])){
for (j in 1:4) {
stop=tryCatch(
{x0<-as.matrix(wide_data[(50*j):(50*(j-1)+11),i])
for (k in 1:10) {
res <- GM(x0,k)
wide_data[(50*(j-1)+11-k),i]<-res$predict
}
},
error=function(e){cat("ERROR :",conditionMessage(e),"\n")}
)
}
}
}
## ERROR : Lapack例行程序dgesv: 系统正好是奇异的: U[1,1] = 0
## ERROR : Lapack例行程序dgesv: 系统正好是奇异的: U[1,1] = 0
## ERROR : Lapack例行程序dgesv: 系统正好是奇异的: U[1,1] = 0
## ERROR : Lapack例行程序dgesv: 系统正好是奇异的: U[1,1] = 0
## ERROR : Lapack例行程序dgesv: 系统正好是奇异的: U[1,1] = 0
## ERROR : Lapack例行程序dgesv: 系统正好是奇异的: U[1,1] = 0
## ERROR : 系统计算上是奇异的: 倒条件数=2.01313e-16
## ERROR : Lapack例行程序dgesv: 系统正好是奇异的: U[1,1] = 0
## ERROR : Lapack例行程序dgesv: 系统正好是奇异的: U[1,1] = 0
## ERROR : Lapack例行程序dgesv: 系统正好是奇异的: U[1,1] = 0
## ERROR : Lapack例行程序dgesv: 系统正好是奇异的: U[1,1] = 0
## ERROR : 系统计算上是奇异的: 倒条件数=8.23857e-17
## ERROR : 系统计算上是奇异的: 倒条件数=3.26825e-17
## ERROR : 系统计算上是奇异的: 倒条件数=0
## ERROR : 系统计算上是奇异的: 倒条件数=0
## ERROR : 系统计算上是奇异的: 倒条件数=0
## ERROR : 系统计算上是奇异的: 倒条件数=0
## ERROR : 系统计算上是奇异的: 倒条件数=1.3416e-17
## ERROR : 系统计算上是奇异的: 倒条件数=7.05496e-18
#print(res)
#plot(res,9)
#TETGR alone
for (j in 1:4) {
x0<-as.matrix(wide_data[(50*j):(50*(j-1)+18),which(colnames(wide_data)=="TETGR")])
for (k in 1:17) {
res <- GM(x0,k)
wide_data[(50*(j-1)+18-k),131]<-res$predict
}
}
We document the data that has been filled in.
write.csv(wide_data,"wide_data_fill.csv")
Because there are some missing values that can not be filled in by the method, we use 0 or the mean value of the corresponding variable of the state to fill in. We name this completed file as “wide_data_fill_final.csv”.
Now we fill in the missing data of the CO2 emission.
Now we preprocess the CO2 emission data of Arizona. Firstly We read the CO2 emission data of Arizona.
co2_ar<-read_xlsx(path = "ar.xlsx", sheet = 1)
Next, transform the long data into the wide data.
co2_ar =co2_ar %>%
gather('1980','1981','1982','1983','1984','1985','1986','1987','1988','1989','1990','1991','1992','1993','1994','1995','1996','1997','1998','1999','2000','2001','2002','2003','2004','2005','2006','2007','2008','2009','2010','2011','2012','2013','2014','2015',key = "year",value = "value")
co2_ar_wide<-spread(co2_ar, key = Name, value = value)
Create a new table “co2_ar_fill”, and copy the original data into rows 21 to 56 of the new table. The first ten rows of the new table is used to fill in the predicted values from 1960 to 1979.
co2_ar_fill<-as.data.frame(array(,dim = c(56,26)))
names(co2_ar_fill)<-names(co2_ar_wide)
co2_ar_fill[21:56,]<-co2_ar_wide[1:36,]
co2_ar_fill[1:20,1]<-c(1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979)
co2_ar_fill_1<-co2_ar_fill
Fill in missing values using the greyscale prediction method. Besides, the table “co2_ar_fill” is filled with the predicted raw values. In the second table “co2_ar_fill_1”, if the predicted value is less than 0, we fill in the missing value with 0.
for(i in 2:26){
stop=tryCatch(
{x0<-as.matrix(co2_ar_fill[56:21,i])
for (k in 1:20) {
res <- GM(x0,k)
co2_ar_fill[(21-k),i]<-res$predict
if(res$predict<0){
co2_ar_fill_1[(21-k),i]<-0
}else{
co2_ar_fill_1[(21-k),i]<-res$predict
}
}
},
error=function(e){cat("ERROR :",conditionMessage(e),"\n")}
)
}
## ERROR : Lapack例行程序dgesv: 系统正好是奇异的: U[1,1] = 0
## ERROR : Lapack例行程序dgesv: 系统正好是奇异的: U[1,1] = 0
## ERROR : Lapack例行程序dgesv: 系统正好是奇异的: U[1,1] = 0
We document the data that has been filled in.
write.csv(co2_ar_fill,"co2_ar_fill.csv")
write.csv(co2_ar_fill_1,"co2_ar_fill_1.csv")
Similarly, we do the same processing steps for CO2 emission data of California state.
co2_ca<-read_xlsx(path = "ca.xlsx", sheet = 1)
co2_ca =co2_ca %>%
gather('1980','1981','1982','1983','1984','1985','1986','1987','1988','1989','1990','1991','1992','1993','1994','1995','1996','1997','1998','1999','2000','2001','2002','2003','2004','2005','2006','2007','2008','2009','2010','2011','2012','2013','2014','2015',key = "year",value = "value")
co2_ca_wide<-spread(co2_ca, key = Name, value = value)
co2_ca_fill<-as.data.frame(array(,dim = c(56,26)))
names(co2_ca_fill)<-names(co2_ca_wide)
co2_ca_fill[21:56,]<-co2_ca_wide[1:36,]
co2_ca_fill[1:20,1]<-c(1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979)
co2_ca_fill_1<-co2_ca_fill
for(i in 2:26){
stop=tryCatch(
{x0<-as.matrix(co2_ca_fill[56:21,i])
for (k in 1:20) {
res <- GM(x0,k)
co2_ca_fill[(21-k),i]<-res$predict
if(res$predict<0){
co2_ca_fill_1[(21-k),i]<-0
}else{
co2_ca_fill_1[(21-k),i]<-res$predict
}
}
},
error=function(e){cat("ERROR :",conditionMessage(e),"\n")}
)
}
## ERROR : Lapack例行程序dgesv: 系统正好是奇异的: U[1,1] = 0
We document the data that has been filled in.
write.csv(co2_ca_fill,"co2_ca_fill.csv")
write.csv(co2_ca_fill_1,"co2_ca_fill_1.csv")
Similarly, we do the same processing steps for CO2 emission data of New Mexico state.
co2_me<-read_xlsx(path = "me.xlsx", sheet = 1)
co2_me =co2_me %>%
gather('1980','1981','1982','1983','1984','1985','1986','1987','1988','1989','1990','1991','1992','1993','1994','1995','1996','1997','1998','1999','2000','2001','2002','2003','2004','2005','2006','2007','2008','2009','2010','2011','2012','2013','2014','2015',key = "year",value = "value")
co2_me_wide<-spread(co2_me, key = Name, value = value)
co2_me_fill<-as.data.frame(array(,dim = c(56,26)))
names(co2_me_fill)<-names(co2_me_wide)
co2_me_fill[21:56,]<-co2_me_wide[1:36,]
co2_me_fill[1:20,1]<-c(1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979)
co2_me_fill_1<-co2_me_fill
for(i in 2:26){
stop=tryCatch(
{x0<-as.matrix(co2_me_fill[56:21,i])
for (k in 1:20) {
res <- GM(x0,k)
co2_me_fill[(21-k),i]<-res$predict
if(res$predict<0){
co2_me_fill_1[(21-k),i]<-0
}else{
co2_me_fill_1[(21-k),i]<-res$predict
}
}
},
error=function(e){cat("ERROR :",conditionMessage(e),"\n")}
)
}
## ERROR : Lapack例行程序dgesv: 系统正好是奇异的: U[1,1] = 0
## ERROR : Lapack例行程序dgesv: 系统正好是奇异的: U[1,1] = 0
write.csv(co2_me_fill,"co2_me_fill.csv")
write.csv(co2_me_fill_1,"co2_me_fill_1.csv")
Similarly, we do the same processing steps for CO2 emission data in Texas state.
co2_te<-read_xlsx(path = "te.xlsx", sheet = 1)
co2_te =co2_te %>%
gather('1980','1981','1982','1983','1984','1985','1986','1987','1988','1989','1990','1991','1992','1993','1994','1995','1996','1997','1998','1999','2000','2001','2002','2003','2004','2005','2006','2007','2008','2009','2010','2011','2012','2013','2014','2015',key = "year",value = "value")
co2_te_wide<-spread(co2_te, key = Name, value = value)
co2_te_fill<-as.data.frame(array(,dim = c(56,26)))
names(co2_te_fill)<-names(co2_te_wide)
co2_te_fill[21:56,]<-co2_te_wide[1:36,]
co2_te_fill[1:20,1]<-c(1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979)
co2_te_fill_1<-co2_te_fill
for(i in 2:26){
stop=tryCatch(
{x0<-as.matrix(co2_te_fill[56:21,i])
for (k in 1:20) {
res <- GM(x0,k)
#co2_te_fill[(21-k),i]<-res$predict
if(res$predict<0){
co2_te_fill_1[(21-k),i]<-0
}else{
co2_te_fill_1[(21-k),i]<-res$predict
}
}
},
error=function(e){cat("ERROR :",conditionMessage(e),"\n")}
)
}
## ERROR : Lapack例行程序dgesv: 系统正好是奇异的: U[1,1] = 0
## ERROR : Lapack例行程序dgesv: 系统正好是奇异的: U[1,1] = 0
write.csv(co2_te_fill,"co2_te_fill.csv")
write.csv(co2_te_fill_1,"co2_te_fill_1.csv")
Because there are some missing values that can not be filled in by the method, so we fill in them using 0 or the mean value of the variable of the corresponding state. Besides, we note that there are several summary variables, so we need to change the predicted value of the summary variable to the sum of the subvariables. We name these completed files as “co2_ar_fill_2.csv”, “co2_ca_fill_2.csv”, “co2_me_fill_2.csv”, “co2_te_fill_2.csv”. After that, we combine the data of several states and document a new file “co2_all_fill_2.csv”.
co2_AZ_fill_2 <- read.csv("co2_ar_fill_2.csv",stringsAsFactors = F)
co2_AZ_fill_2<-co2_AZ_fill_2[,-1]
co2_CA_fill_2 <- read.csv("co2_ca_fill_2.csv",stringsAsFactors = F)
co2_CA_fill_2<-co2_CA_fill_2[,-1]
co2_NM_fill_2 <- read.csv("co2_me_fill_2.csv",stringsAsFactors = F)
co2_NM_fill_2<-co2_NM_fill_2[,-1]
co2_TX_fill_2 <- read.csv("co2_te_fill_2.csv",stringsAsFactors = F)
co2_TX_fill_2<-co2_TX_fill_2[,-1]
We combine the data of the four states.
co2_AZ<-cbind(StateCode="AZ",co2_AZ_fill_2)
co2_CA<-cbind(StateCode="CA",co2_CA_fill_2)
co2_NM<-cbind(StateCode="NM",co2_NM_fill_2)
co2_TX<-cbind(StateCode="TX",co2_TX_fill_2)
co2data<-rbind(co2_AZ,co2_CA,co2_NM,co2_TX)
names(co2data)[names(co2data) == 'StateCode'] <- 'state'
write.csv(co2data,"co2_all_fill_2.csv")
After we have done some of the descriptive analysis of the energy data, we finally need to propose a method to evaluate the performance of each state. Because where exists a large number of variables that correlated to the performance, we didn’t do the descriptive analysis for all the variables. Here we try to establish a system, that we called it as Energy Perform Index, to do the evaluation. Here we also gain the help from Global Energy Architecture Performance Index Report 2017, cited as
World Economic Forum. (2017). Global Energy Architecture Performance Index Report 2017. WEF and Accenture, 1-32.
We utilize some of the variables which have already been introduced before combined with other variables we come up with or borrowed from the report above. We divide the variables to three main indexes, three perspective that we can gain insight of the energy performance:
For each index, we take four variables as the indicators (sub-index), which we will illustrate in detailed formula later. Here we give out our architecture of our Energy Performance Index system on the figure showed below.
Load data which has been preprocessed.
energy <- read.csv('wide_data_fill_final_Tang.csv', header = T)
columns <- names(energy)
It tells the proportion of renewable energy production and the total production. It shows the ability and consciousness of exact state to use renewable energy. The more percent of renewable energy we produce, the less potential of pollution and waste we will yield, the better for the environment.
\[ Renewable\ Energy\ Production\ Ratio = \frac{rewable\ energy\ production(billion~btu)}{total\ energy\ production(billion~btu)} \]
## Extract production data
production <- energy %>%
select(Year, StateCode, BMTCB, GETCB, HYTCB, WYTCB, SOTCB, NUETB, CLPRB, PAPRB, NGMPB, TEPRB) %>%
mutate(Total_Production = BMTCB + GETCB + HYTCB + WYTCB + SOTCB + NUETB + CLPRB + PAPRB + NGMPB) %>%
mutate(Renew_Ratio = (BMTCB + GETCB + HYTCB + WYTCB + SOTCB)/Total_Production)
ggplot(data = production, mapping = aes(x = Year, y = Renew_Ratio))+
geom_line(mapping = aes(color = StateCode)) +
labs(y = "Renewable Energy Production Ratio")
It tells the proportion of renewable energy consumption and the total consumption. It shows the popularity of renewable energy in a state. Similar to the Renewable Production Ratio, The more percent of renewable energy we consume, the better for the environment. \[ Renewable\ Energy\ Consumption\ Ratio = \frac{rewable\ energy\ consumption(billion~btu)}{total\ energy\ consumption(billion~btu)} \]
## Extract consumption data
consumption <- energy %>%
select(Year, StateCode, BMTCB, CLTCB, GETCB, HYTCB, NNTCB, PMTCB, WYTCB, SOTCB, NUETB, TETCB) %>%
mutate(Total_Consumption = BMTCB + CLTCB + GETCB + HYTCB + NNTCB + PMTCB + WYTCB + SOTCB + NUETB) %>%
mutate(Renew_Ratio = (BMTCB + GETCB + HYTCB + WYTCB + SOTCB)/Total_Consumption)
ggplot(data = consumption, mapping = aes(x = Year, y = Renew_Ratio))+
geom_line(mapping = aes(color = StateCode)) +
labs(y = "Renewable Energy Consumption Ratio")
Because the energy department only provide factor of converting for three main energy: natural gas, petroleum and coal, so we can only use these three factors. It shows the efficiency of converting primary energy to electricity. Efficiency shows whether the energy had been maximally utilized. If we can take maximum use of every unit energy, we can waste less and help the environment much. \[ Efficiency = \frac{1}{n} \sum_{i=1}^n {factor~of~converting~by~energy~type(million~btu~per~short~ton)} \]
## MinMaxScaler
rescale01 <- function(x) {
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
rescale01_omit0 <- function(x) {
y <- x[x!=0]
rng <- range(y, na.rm = TRUE)
for (i in seq(1, length(x), by=1)){
if (x[i] != 0) x[i] = (x[i] - rng[1]) / (rng[2] - rng[1])
}
return (x)
}
energy <- energy %>%
mutate(
CLEIK_scaled = rescale01(CLEIK),
NGEIK_scaled = rescale01(NGEIK),
PAEIK_scaled = rescale01(PAEIK)
)
## Calculate efficiency
efficiency <- energy %>%
select(Year, StateCode, CLEIK_scaled, NGEIK_scaled, PAEIK_scaled) %>%
mutate(Efficiency = (CLEIK_scaled+NGEIK_scaled+PAEIK_scaled)/3)
ggplot(data = efficiency, mapping = aes(x = Year, y = Efficiency))+
geom_line(mapping = aes(color = StateCode)) +
labs(y = "Efficiency (Factor of Converting, scaled to 0-1)")
It shows the level of carbon dioxide emission. As we all know, carbon dioxide is the main contributor of greenhouse effect which has a significant influence on environmental change. One main purpose that we advocate utilizing clean energy is that clean energy can markedly reduce the emission of carbon dioxide. We can assert that the less CO2 emission is, the healthier the energy performance is. \[ CO_2~Emission~per~unit~Energy~Use=1-\frac{CO_2~emission(million~metric~tons)}{total~energy~consumption(billion~btu)}(MaxMin~scaled) \]
## Bind data from different files
detach("package:plyr")
co2_emission <- rbind(co2az, co2ca, co2nm, co2tx) %>%
select(year, StateCode, Total) %>%
rename(Year=year, CO2_Emission=Total) %>%
filter(Year<=2009)
## Extract consumption data
consumption2 <- energy %>%
select(Year, StateCode, BMTCB, CLTCB, GETCB, HYTCB, NNTCB, PMTCB, WYTCB, SOTCB, NUETB, TETCB) %>%
mutate(Total_Consumption = BMTCB + CLTCB + GETCB + HYTCB + NNTCB + PMTCB + WYTCB + SOTCB + NUETB) %>%
select(Year, StateCode, Total_Consumption, TETCB)
## Take the log of consumption to prevent small values
co2_emission <- co2_emission %>%
left_join(consumption2, by = c('Year', 'StateCode')) %>%
mutate(Unit_CO2_Emission = CO2_Emission/log(Total_Consumption))
## Warning: Column `StateCode` joining character vector and factor, coercing into
## character vector
## MinMaxScaler
co2_emission <- co2_emission %>%
mutate(Unit_CO2_Emission_scaled = 1 - rescale01(Unit_CO2_Emission))
ggplot(data = co2_emission, mapping = aes(x = Year, y = Unit_CO2_Emission_scaled))+
geom_line(mapping = aes(color = StateCode)) +
labs(y = "Unit CO2 Emission (scaled to 1-0)")
It is defined as GDP per unit energy use. It shows the contribution of per unit energy to the economy, which can also be used to compare the productivity of energy by state. \[ Energy ~Intensity=\frac{GDP(million~dollars)}{total~energy~consumption(billion~btu)} \]
gdp <- read.csv('GDP_chained.csv')
gdp <- gdp %>%
gather('AZ', 'CA', 'NM', 'TX', key = 'StateCode', value = GDP_Chained) %>%
filter(Year<=2009)
intensity <- gdp %>%
left_join(consumption, by = c('Year', 'StateCode')) %>%
mutate(Energy_Intensity = GDP_Chained/Total_Consumption) %>%
select(Year, StateCode, GDP_Chained, Total_Consumption, Energy_Intensity)
## Warning: Column `StateCode` joining character vector and factor, coercing into
## character vector
## MinMaxScaler
intensity <- intensity %>%
mutate(Energy_Intensity_scaled = rescale01(Energy_Intensity))
ggplot(data = intensity, mapping = aes(x = Year, y = Energy_Intensity_scaled))+
geom_line(mapping = aes(color = StateCode)) +
labs(y = "Energy Intensity (scaled to 0-1)")
It is the proportion of energy export expenditure and GDP. It shows part of the contribution to total GDP made by energy production. \[ Export ~Contribution=\frac{total~energy~export(million~dollars)}{GDP(million~dollars)} \]
exp_imp <- energy %>%
select(Year, StateCode, ELEXV, ELIMV) %>%
left_join(gdp, by = c('Year', 'StateCode')) %>%
mutate(Export_Contri = ELEXV/GDP_Chained,
Import_Contri = ELIMV/GDP_Chained) %>%
mutate(Export_Contri_scaled = rescale01(Export_Contri),
Import_Contri_scaled = rescale01(Import_Contri))
## Warning: Column `StateCode` joining factor and character vector, coercing into
## character vector
ggplot(data = exp_imp, mapping = aes(x = Year, y = Export_Contri_scaled))+
geom_line(mapping = aes(color = StateCode)) +
labs(y = "Export Contribution (scaled to 0-1)")
## Warning: Removed 12 rows containing missing values (geom_path).
Similiar to Export Contribution. \[ Import ~Contribution=\frac{total~energy~import(million~dollars)}{GDP(million~dollars)} \]
ggplot(data = exp_imp, mapping = aes(x = Year, y = Import_Contri_scaled))+
geom_line(mapping = aes(color = StateCode)) +
labs(y = "Import Contribution (scaled to 0-1)")
## Warning: Removed 12 rows containing missing values (geom_path).
It is represented by the electricity prices for industry. Empirically, more business owners want to produce more when the electricity price is lower, which subsequently can contribute to the growth of GDP. \[ Affordability=1-electricity~ price~(MaxMin~scaled) \]
afford <- energy %>%
select(Year, StateCode, ESTCD) %>%
mutate(ESTCD_scaled = 1 - rescale01_omit0(ESTCD))
ggplot(data = afford, mapping = aes(x = Year, y = ESTCD_scaled))+
geom_line(mapping = aes(color = StateCode)) +
labs(y = "Price Affordability (scaled to 0-1)")
It measures the ability for a state survive from import block. We take the electricity consumption to analysis. It is defined as the ratio of energy self-production and energy total consumption. If the import of electricity occupies a larger proportion of the total consumption, the state is more dependent on other states or other countries. Under this measure, dependence is not a good character. The less dependent on other regions, the higher the self sufficiency is. \[ Self ~Sufficiency=1-\frac{electricity ~import(billion~btu)}{electricity ~total~ consumption(billion~btu)} \]
self_suff <- energy %>%
select(Year, StateCode, ELIMB, ESTCB) %>%
mutate(Self_Suff = 1 - ELIMB/ESTCB)
ggplot(data = self_suff, mapping = aes(x = Year, y = Self_Suff))+
geom_line(mapping = aes(color = StateCode)) +
labs(y = "Self Sufficiency")
It is an indicator quantify the diversity of a state’s energy structure, the higher the diversity index is, the more energy types are and the more evenly those types of energy are distributed in the structure. We perceive that if one of the energy is confronted by import block or other unexpected events, the state with diverse energy supply will be little influenced and function much better than the state which mainly depend on just on or two types of energy. However, we’ll finally reach a point that fossil fuels will be used up and the diversity must decline. Since in the short term there will be little concern about the energy exhaustion, we still use this measure. To calculate the diversity, we gain the help of Herfindahl index. \[ Energy ~Diversity ~Index=1- \sum_{i=1}^n P_i^2 \]
\(\qquad where~P_i=the~percentage~of~each~ energy~ type~ consumption~ and~ total~ energy ~consumption.\)
## BMTCB + CLTCB + GETCB + HYTCB + NNTCB + PMTCB + WYTCB + SOTCB + NUETB = Total
edi <- consumption %>%
mutate(EDI = 1 - ((BMTCB/Total_Consumption)^2+
(CLTCB/Total_Consumption)^2+
(GETCB/Total_Consumption)^2+
(HYTCB/Total_Consumption)^2+
(NNTCB/Total_Consumption)^2+
(PMTCB/Total_Consumption)^2+
(WYTCB/Total_Consumption)^2+
(SOTCB/Total_Consumption)^2+
(NUETB/Total_Consumption)^2))
ggplot(data = edi, mapping = aes(x = Year, y = EDI))+
geom_line(mapping = aes(color = StateCode)) +
labs(y = "Energy Diversity Index")
It measures the stability of energy price. Price stability influence the anticipation of business owners and individuals. If the energy price varies so severely, energy-intensive corporations will suffer heavy cost burden. Consequently, it may prevent investors from investing money into those corporations. Moreover, individuals may conceive it as an unsecure governmental or economical issue, which may cause complaints and social problems. We use the change percentage of annual primary energy average price (dollars per million btu) to quantify this indicator. \[ Price ~Stability=1-\frac{ \left| current ~year's ~price-last~ year's ~price \right|}{last ~year's~ price} \]
price <- energy[c('Year','StateCode','PETCD')]
p_az <- filter(price, StateCode=='AZ')
p_ca <- filter(price, StateCode=='CA')
p_nm <- filter(price, StateCode=='NM')
p_tx <- filter(price, StateCode=='TX')
## calculate price stability
cal_ps <- function(p) {
res <- c()
for (i in seq(1, 11, by=1)){
res[i] <- 1
}
for (i in seq(12, length(p), by=1)){
res[i] <- 1 - abs(p[i] - p[i-1])/p[i-1]
}
return(res)
}
ps_az <- p_az %>%
mutate(Price_Stab = cal_ps(PETCD))
ps_ca <- p_ca %>%
mutate(Price_Stab = cal_ps(PETCD))
ps_nm <- p_nm %>%
mutate(Price_Stab = cal_ps(PETCD))
ps_tx <- p_tx %>%
mutate(Price_Stab = cal_ps(PETCD))
ps <- rbind(ps_az, ps_ca, ps_nm, ps_tx)
ggplot(data = ps, mapping = aes(x = Year, y = Price_Stab))+
geom_line(mapping = aes(color = StateCode)) +
labs(y = "Price Stability")
It indicates the average ability of a state to generate energy. If one person can produce more energy, there is less worry about energy accessibility. \[ Energy ~Production ~per~ capita=\frac{total ~primary ~energy~ production(billion~btu)}{population ~size(thousand)} \]
eppc <- energy %>%
left_join(production, by = c('Year', 'StateCode')) %>%
select('Year', 'StateCode', 'TPOPP', 'Total_Production') %>%
mutate(EPPC = Total_Production/TPOPP) %>%
mutate(EPPC_scaled = rescale01(EPPC))
ggplot(data = eppc, mapping = aes(x = Year, y = EPPC_scaled))+
geom_line(mapping = aes(color = StateCode)) +
labs(y = "Energy Production per capita (scaled to 0-1)")
The analytic hierarchy process (AHP) is a structured technique for organizing and analyzing complex decisions, based on mathematics and psychology. It can generate weights from a combined way of subjectivity and objectivity. Here we use AHP to determine the weights of each indicators coupled with their mother index.
##输入:judgeMatrix 判断矩阵;round 结果约分位数
##输出:权重
weight <- function (judgeMatrix, round=3) {
n = ncol(judgeMatrix)
cumProd <- vector(length=n)
cumProd <- apply(judgeMatrix, 1, prod) ##求每行连乘积
weight <- cumProd^(1/n) ##开n次方(特征向量)
weight <- weight/sum(weight) ##求权重
round(weight, round)
}
##输入:judgeMatrix
##输出:CI, CR
CRtest <- function (judgeMatrix, round=3){
RI <- c(0, 0, 0.58, 0.9, 1.12, 1.24, 1.32, 1.41, 1.45, 1.49, 1.51)
Wi <- weight(judgeMatrix) ##计算权重
n <- length(Wi)
if(n > 11){
cat("判断矩阵过大,请少于11个指标 \n")
}
if (n > 2) {
W <- matrix(Wi, ncol = 1)
judgeW <- judgeMatrix %*% W
JudgeW <- as.vector(judgeW)
la_max <- sum(JudgeW/Wi)/n
CI = (la_max - n)/(n - 1)
CR = CI/RI[n]
cat("CI=", round(CI, round), "\n")
cat("CR=", round(CR, round), "\n")
if (CR <= 0.1) {
cat(" Passed CR test! \n")
cat("Weights: ", round(Wi, round), "\n")
}
else {
cat("Please adjust the judge matrix, make CR < 0.1 \n")
Wi = NULL
}
}
else if (n <= 2) {
return(Wi)
}
consequence <- c(round(CI, round), round(CR, round))
names(consequence) <- c("CI", "CR")
consequence
}
## the order of 4 indicators is RPR, RCR, EFF, CO2
## generate judgeMatrix
b <- c(1,3,1/3,1/2,
1/3,1,1/5,2,
3,5,1,4,
1/2,1/2,1/4,1)
judgeMatix <- matrix(b, ncol=4)
## calculate the Consistency Index and Consistency Ratio
CRtest(judgeMatix)
## CI= 0.013
## CR= 0.014
## Passed CR test!
## Weights: 0.197 0.388 0.084 0.331
## CI CR
## 0.013 0.014
## the order of 4 indicators is EI, ExpContri, ImpContri, Aff
## generate judgeMatrix
b <- c(1,1/2,1/3,1/5,
2,1,1/2,1/3,
3,2,1,2/3,
5,3,3/2,1)
judgeMatix <- matrix(b, ncol=4)
## calculate the Consistency Index and Consistency Ratio
CRtest(judgeMatix)
## CI= 0.003
## CR= 0.003
## Passed CR test!
## Weights: 0.485 0.273 0.147 0.095
## CI CR
## 0.003 0.003
## the order of 4 indicators is SelfSuff, EDI, PS, EPPC
## generate judgeMatrix
b <- c(1,5,2,4,
1/5,1,1/3,1/2,
1/2,3,1,2,
1/4,2,1/2,1)
judgeMatix <- matrix(b, ncol=4)
## calculate the Consistency Index and Consistency Ratio
CRtest(judgeMatix)
## CI= 0.007
## CR= 0.008
## Passed CR test!
## Weights: 0.081 0.476 0.155 0.288
## CI CR
## 0.007 0.008
## set weights
es_w <- matrix(c(0.197, 0.388, 0.084, 0.331), ncol=1)
eg_w <- matrix(c(0.485, 0.273, 0.147, 0.095), ncol=1)
ea_w <- matrix(c(0.081, 0.476, 0.155, 0.288), ncol=1)
## combine indicators
## ES
es <- production %>%
select('Year', 'StateCode', 'Renew_Ratio') %>%
transmute(Year = Year,
StateCode = StateCode,
renew_ratio_production = Renew_Ratio) %>%
left_join(consumption, by = c('Year', 'StateCode')) %>%
mutate(renew_ratio_consumption = Renew_Ratio) %>%
left_join(efficiency, by = c('Year', 'StateCode')) %>%
mutate(efficiency_factor_converting = Efficiency) %>%
left_join(co2_emission, by = c('Year', 'StateCode')) %>%
mutate(unit_co2_emission = Unit_CO2_Emission_scaled) %>%
select('Year', 'StateCode', 'renew_ratio_production', 'renew_ratio_consumption',
'efficiency_factor_converting', 'unit_co2_emission') %>%
mutate(es = matrix(c(renew_ratio_production,
renew_ratio_consumption,
efficiency_factor_converting,
unit_co2_emission), ncol=4)%*%es_w)
## Warning: Column `StateCode` joining factor and character vector, coercing into
## character vector
## EG
eg <- intensity %>%
select('Year', 'StateCode', 'Energy_Intensity_scaled') %>%
mutate(energy_intensity = Energy_Intensity_scaled) %>%
left_join(exp_imp, by = c('Year', 'StateCode')) %>%
mutate(export_contribution = Export_Contri_scaled,
import_contribution = Import_Contri_scaled) %>%
left_join(afford, by = c('Year', 'StateCode')) %>%
mutate(affordability = ESTCD_scaled) %>%
select('Year', 'StateCode', 'energy_intensity', 'export_contribution',
'import_contribution', 'affordability') %>%
mutate(eg = matrix(c(energy_intensity,
export_contribution,
import_contribution,
affordability), ncol=4)%*%eg_w)
## Warning: Column `StateCode` joining character vector and factor, coercing into
## character vector
## EA
ea <- self_suff %>%
select('Year', 'StateCode', 'Self_Suff') %>%
mutate(self_sufficiency = Self_Suff) %>%
left_join(edi, by = c('Year', 'StateCode')) %>%
mutate(energy_diversity_index = EDI) %>%
left_join(ps, by = c('Year', 'StateCode')) %>%
mutate(price_stability = Price_Stab) %>%
left_join(eppc, by = c('Year', 'StateCode')) %>%
mutate(energy_production_per_capita = EPPC_scaled) %>%
select('Year', 'StateCode', 'self_sufficiency', 'energy_diversity_index',
'price_stability', 'energy_production_per_capita') %>%
mutate(ea = matrix(c(self_sufficiency,
energy_diversity_index,
price_stability,
energy_production_per_capita), ncol=4)%*%ea_w)
ggplot(data = es, mapping = aes(x = Year, y = es))+
geom_line(mapping = aes(color = StateCode)) +
labs(y = "Environmental Sustainability")
ggplot(data = eg, mapping = aes(x = Year, y = eg))+
geom_line(mapping = aes(color = StateCode)) +
labs(y = "Economic Growth and Development")
ggplot(data = ea, mapping = aes(x = Year, y = ea))+
geom_line(mapping = aes(color = StateCode)) +
labs(y = "Energy Access and Security")
As for Environmental Sustainability, California and Texas keep growing on this index. On the contrary, Arizona had a huge decrease of almost 0.2 over the 50 years. As for the general performance, none of the state gain a relatively high score of Environmental Sustainability. Here we come to Economic Growth and Development. All of the states keep increasing on this index, which indicate that the energy play a more and more important role in economic growth. Specificly, California has the fastest speed of growing. About Energy Access and Security, all the states show a fluctuation around some baselines. New Mexico always does the best and Califonia always does the worst.
index_2007 <- es %>%
left_join(eg, by = c('Year', 'StateCode')) %>%
left_join(ea, by = c('Year', 'StateCode')) %>%
filter(Year == 2007) %>%
select('StateCode', 'es', 'eg', 'ea')
## Warning: Column `StateCode` joining character vector and factor, coercing into
## character vector
ggradar(index_2007)
index_2007 <- index_2007 %>%
mutate(average = (es+eg+ea)/3)
index_2007
## StateCode es eg ea average
## 1 AZ 0.37795874 0.4497363 0.6019171 0.4765374
## 2 CA 0.26927682 0.6111100 0.5294943 0.4699603
## 3 NM 0.35883011 0.3260915 0.7180236 0.4676484
## 4 TX 0.05995843 0.2240128 0.5852338 0.2897350
index_2007.T <- t(index_2007[,2:ncol(index_2007)])
colnames(index_2007.T) <- index_2007[,1]
index_2007.T <- data.frame(index_2007.T)
index_2007.T <- index_2007.T %>%
mutate(Indexes = row.names(index_2007.T)) %>%
select('Indexes', 'AZ', 'CA', 'NM', 'TX')
ggradar(index_2007.T)
For a specific year, we can draw a radar chart to compare among the four states. As we can see in 2007, Arizona, California and New Mexico win one index respectively. Thus, it is very different to determine the best state. If we set the weights of the three indexes evenly, we can see that Arizona reaches the highest average of 0.4765.
However, different people, different governments will see things differently from different aspects. The weights can also be determined using AHP or simply by human. Therefore, we just leave the final step open to everyone who are interested in this index system and for the users who have different situations and concerns.
We have the production and consumption data for nine energy sources from 1970 to 2009. Till now we have visualized the data and analyzed the energy situation in each state. Now we want to make some predictions based on historical data. Since nine energy sources is too many, we’re going to choose three major energy sources from the nine sources to do the prediction, that is, coal, oil and natural gas. And we predict the value from 2010 to 2017.
Firstly, we read the initial data file. The second and third data files are consumption data and production data from 1960 to 2017 that we download from the office website.
use_btu<-read.csv("use_all_btu.csv",stringsAsFactors = F)
pro_btu<-read_xlsx(path = "Prod_btu.xlsx", sheet = 1)
Now we select the consumption data and production data of the three major energy sources from the initial data.
initial_data<-select(wide_data,StateCode,Year,Coal_use=CLTCB,Nat_gas_use=NNTCB,Petr_use=PMTCB,Coal_pro=CLPRB,Nat_gas_pro=NGMPB,Petr_pro=PAPRB)
Next step, we select the consumption data of the three major energy sources for “AZ”,“CA”,“NM”,“TX” the four states from the office consumption data.
use_btu<-use_btu[,-1]
use_btu<-filter(use_btu,MSN %in% c("CLTCB","NNTCB","PMTCB"))
use_btu<-filter(use_btu,State %in% c("AZ","CA","NM","TX"))
Now each row is a time series, which is not easy for us to process, so we turn it into long data.
use_btu<-melt(use_btu,id.vars = c("State","MSN"),variable.name = "Year",value.name = "Value")
use_btu$Year<-substr(as.character(use_btu$Year),2,5)
use_btu$Year<-as.integer(use_btu$Year)
use_btu<-spread(use_btu,key = MSN,value = Value)
names(use_btu)<-c("StateCode","Year","CLTCB","NNTCB","PMTCB")
Similarly, we process the production data to make it easy for us to manipulate.
pro_btu<-pro_btu[,-1]
pro_btu<-filter(pro_btu,MSN %in% c("CLPRB","NGMPB","PAPRB"))
pro_btu<-filter(pro_btu,StateCode %in% c("AZ","CA","NM","TX"))
pro_btu<-melt(pro_btu,id.vars = c("StateCode","MSN"),variable.name = "Year",value.name = "Value")
pro_btu<-spread(pro_btu,key = MSN,value = Value)
pro_btu$Year<-as.numeric(as.character(pro_btu$Year))
#pro_btu$Year<-as.integer(pro_btu$Year)
Then, let’s combine the three data to allow us to fill in the predicted values from 2010 to 2017 and calculate the error.
fore_data<-use_btu
fore_data<-left_join(fore_data,pro_btu)
## Joining, by = c("StateCode", "Year")
fore_data<-left_join(fore_data,initial_data)
## Joining, by = c("StateCode", "Year")
result = tryCatch(
{expr},
warning = function(w) {warning-handler-code},
error = function(e) { error-handler-code}
)
tryCatch({a<-"c"
b<-"c"
b==a},
error=function(e){cat("hahaha",conditionMessage(e),"\n\n")},
finally={print("ccc")})
## [1] "ccc"
## [1] TRUE
tryCatch(
{ a<-"c"
b<-"c"
b==a},
warning = function(w) {cat("WARNING :",conditionMessage(e),"\n")},
error=function(e){cat("ERROR :",conditionMessage(e),"\n")}
)
## [1] TRUE
We create a new data frame to save the mean absolute percentage error(MAPE) of the predicted value.
result<-as.data.frame(array(,dim = c(32,8)))
names(result)<-c("StateCode","Year","Coal_use","Nat_gas_use","Petr_use","Coal_pro","Nat_gas_pro","Petr_pro")
Next, we use the ARIMA method to predict the production and consumption value from 2010 to 2017. Because we have many time series to predict, so we use auto.arima to do the prediction.
for(i in 1:4){
for (j in 1:6) {
stop=tryCatch(
{long<-fore_data[(58*(i-1)+1):(58*(i-1)+50),(8+j)]
c<-auto.arima(ts(long))
f<-forecast(c,h=8)
tem<-f$mean
for (k in 1:8) {
if(tem[k]<0){
fore_data[(58*(i-1)+50+k),(8+j)]<-0
}else{
fore_data[(58*(i-1)+50+k),(8+j)]<-tem[k]
}
rate_arima<-abs(fore_data[(58*(i-1)+50+k),(8+j)]-fore_data[(58*(i-1)+50+k),(2+j)])/fore_data[(58*(i-1)+50+k),(2+j)]
result[(8*(i-1)+k),1:2]<-fore_data[(58*(i-1)+50+k),1:2]
result[(8*(i-1)+k),(j+2)]<-rate_arima
}
},
error=function(e){cat("ERROR :",conditionMessage(e),"\n")}
)
}
}
When we look at the error in the result table, we find that there are some NaNs. Why? We go back to the original table, and we can see that the real values are 0, and we put them in the denominator. Since our predicted values in these points are also 0, the errors are 0, and we manually populate the NaNs with 0.
result[9:16,6]<-0
Now we calculate the total MAPE between all the predicted values and the real values.
print(sum(result[,3:8]/192))
## [1] 0.4818238
The MAPE is 0.48, which is not small. What happened?
We draw the true and predicted values of the production and consumption of the three major energy sources. The followings are six pictures. In each picture, the line represents the true data which we download from the official website. The plot from 1960 to 2009 is the value what we have originally. The plot from 2010 to 2017 represents the predicted value. In some pictures, the values for points from 1960 to 2009 may be slightly different from those on the line chart, but we ignore them. We find the predicted values of some states or some energy sources are pretty accurate, but some are not.
ggplot(fore_data) +
geom_line(mapping = aes(x=Year,y=CLTCB,color=StateCode)) +
geom_point(mapping = aes(x=Year,y=Coal_use,color=StateCode))
ggplot(fore_data) +
geom_line(mapping = aes(x=Year,y=NNTCB,color=StateCode)) +
geom_point(mapping = aes(x=Year,y=Nat_gas_use,color=StateCode))
ggplot(fore_data) +
geom_line(mapping = aes(x=Year,y=PMTCB,color=StateCode)) +
geom_point(mapping = aes(x=Year,y=Petr_use,color=StateCode))
ggplot(fore_data) +
geom_line(mapping = aes(x=Year,y=CLPRB,color=StateCode)) +
geom_point(mapping = aes(x=Year,y=Coal_pro,color=StateCode))
ggplot(fore_data) +
geom_line(mapping = aes(x=Year,y=NGMPB,color=StateCode)) +
geom_point(mapping = aes(x=Year,y=Nat_gas_pro,color=StateCode))
ggplot(fore_data) +
geom_line(mapping = aes(x=Year,y=PAPRB,color=StateCode)) +
geom_point(mapping = aes(x=Year,y=Petr_pro,color=StateCode))
Let’s look at the error by state by energy. The column Total is the average error of the energy in the four states.
result_state<-as.data.frame(array(,dim = c(5,7)))
names(result_state)<-c("StateCode","Coal_use","Nat_gas_use","Petr_use","Coal_pro","Nat_gas_pro","Petr_pro")
for (i in 1:6) {
for (j in 1:4) {
result_state[j,1]<-result[8*j,1]
result_state[j,(i+1)]<-sum(result[(8*(j-1)+1):(8*j),(i+2)])/8
}
result_state[5,(i+1)]<-sum(result_state[1:4,(i+1)])/4
}
result_state[5,1]<-"Total"
result_state_1<-result_state %>%
gather("Coal_use","Nat_gas_use","Petr_use","Coal_pro","Nat_gas_pro","Petr_pro",key = "Energy",value = "Value")
ggplot(result_state_1) +
geom_point(mapping = aes(y=Value,x=StateCode,color=Energy))
As we can see from the point plot, the prediction error of natural gas production in AZ state is very large, up to more than 6. We need to try other methods to reduce this error.
print(result_state$Nat_gas_pro)
## [1] 6.53571457 0.08199723 0.08924579 0.08237305 1.69733266
Next we use the PROPHET method to make predictions. Prophet is an open-source data prediction tool from Facebook based on Python and R languages. Prophet has smoothing parameters for seasonality that allow developers to adjust to match historical cycles. It also has a smoothing parameter for trends, which adjusts how closely it follows historical trends. For growth curve (growth curves), developers can artificial cap, namely capacities, about "how the forecast growth (or decline) of a priori information injected into it. Finally, developers can set irregular dates to model special days like the Thanksgiving, and black Friday.
At its core, Prophet is an additive regression model that has four components: the tendency of a piecewise linear or logical growth curve, which automatically detects trend changes by extracting transition points in the data; an annual periodic component modeled using Fourier series; a weekly periodic component, using dummy variables; important festivals table set by user.
result1<-as.data.frame(array(,dim = c(32,8)))
names(result1)<-c("StateCode","Year","Coal_use","Nat_gas_use","Petr_use","Coal_pro","Nat_gas_pro","Petr_pro")
fore_data1<-fore_data
for(i in 1:4){
for (j in 1:6) {
stop=tryCatch(
{long<-fore_data1[(58*(i-1)+1):(58*(i-1)+50),c(2,(8+j))]
names(long)<-c("ds","y")
long$ds <- paste(as.character(long$ds),"0101",sep="")
long$ds<-as.Date(long$ds, format= "%Y%m%d")
#c<-auto.arima(ts(long))
m<-prophet(long,daily.seasonality=TRUE,weekly.seasonality=TRUE)
#f<-forecast(c,h=8)
#tem<-f$mean
future <- make_future_dataframe(m,periods =8,freq = 'year')
forecast <- predict(m,future)
for (k in 1:8) {
if(forecast$yhat[50+k]<0){
fore_data1[(58*(i-1)+50+k),(8+j)]<-0
}else{
fore_data1[(58*(i-1)+50+k),(8+j)]<-forecast$yhat[50+k]
}
rate_arima<-abs(fore_data1[(58*(i-1)+50+k),(8+j)]-fore_data1[(58*(i-1)+50+k),(2+j)])/fore_data1[(58*(i-1)+50+k),(2+j)]
result1[(8*(i-1)+k),1:2]<-fore_data1[(58*(i-1)+50+k),1:2]
result1[(8*(i-1)+k),(j+2)]<-rate_arima
}
},
error=function(e){cat("ERROR :",conditionMessage(e),"\n")}
)
}
}
Similarly, we manually populate the NaNs with 0.
result1[9:16,6]<-0
We calculate the error by state by energy. Then, we can find some other errors expect for the error of the natural gas for the AZ State are larger than those obtained from the ARIMA method. So we just look at the error of the natural gas for the AZ State.
result1_state<-as.data.frame(array(,dim = c(5,7)))
names(result1_state)<-c("StateCode","Coal_use","Nat_gas_use","Petr_use","Coal_pro","Nat_gas_pro","Petr_pro")
for (i in 1:6) {
for (j in 1:4) {
result1_state[j,1]<-result1[8*j,1]
result1_state[j,(i+1)]<-sum(result1[(8*(j-1)+1):(8*j),(i+2)])/8
}
result1_state[5,(i+1)]<-sum(result1_state[1:4,(i+1)])/4
}
result1_state[5,1]<-"Total"
The error of the natural gas for the AZ State is more than 1, which is greatly reduced.
print(result1_state$Nat_gas_pro)
## [1] 1.9555772 0.1296952 0.2616772 0.2003537 0.6368258
We assign the values of the natural gas for the AZ State predicted by the Prophet method to the values predicted by the Arima method. Then we calculate the average error of all the predicted values. The average error is 0.29, which is pretty good. So that’s the end of our forecast for production and consumption.
fore_data[51:58,7]<-fore_data1[51:58,7]
result[1:8,7]<-result1[1:8,7]
print(sum(result[,3:8])/192)
## [1] 0.2909847
Now we let’s predict the CO2 emission data. We used CO2 data from 1960 to 2009 to predict CO2 emissions of the three major sources from 2010 to 2015. We select the total emission from coal, natural gas and petroleum.
co2<-select(co2data,state,year,Fuel.Totals.Coal, Fuel.Totals.Petroleum.Products, Fuel.Totals.Natural.Gas)
Next, we use the ARIMA method to predict the CO2 emission value from 2010 to 2015. We create a new data frame to save the mean absolute percentage error(平均百分比误???) of the predicted value.
result_co2<-as.data.frame(array(,dim = c(24,5)))
names(result_co2)<-c("StateCode","Year","Coal","Nat_gas","Petr")
Next, we use the ARIMA method to do the prediction.
for(i in 1:4){
for (j in 1:3) {
stop=tryCatch(
{long<-co2[(56*(i-1)+1):(56*(i-1)+50),(2+j)]
c<-auto.arima(ts(long))
f<-forecast(c,h=6)
tem<-f$mean
for (k in 1:6) {
if(tem[k]<0){
co2[(56*(i-1)+50+k),(5+j)]<-0
}else{
co2[(56*(i-1)+50+k),(5+j)]<-tem[k]
}
rate_arima<-abs(co2[(56*(i-1)+50+k),(5+j)]-co2[(56*(i-1)+50+k),(2+j)])/co2[(56*(i-1)+50+k),(2+j)]
result_co2[(6*(i-1)+k),1]<-as.character(co2[(56*(i-1)+50+k),1])
result_co2[(6*(i-1)+k),2]<-co2[(56*(i-1)+50+k),2]
result_co2[(6*(i-1)+k),(j+2)]<-rate_arima
}
#result1[i,3]<-rate_arima1
#result1[i,4]<-rate_arima2
},
error=function(e){cat("ERROR :",conditionMessage(e),"\n")}
)
}
}
Now we calculate the total MAPE between all the predicted values and the real values. The error rate is 0.13, which is pretty small.
print(sum(result_co2[,3:5]/72))
## [1] 0.1307371
We check the error rate by state and by energy. It can be found that from 2013 to 2016, the error rate of coal energy in CA state is relatively large, while other errors are relatively small. Since the total error is small, we will not continue to explore. So that’s the end of our forecast for CO2 emission.
ggplot(result_co2) +
geom_line(mapping = aes(x=Year,y=Coal,color=StateCode))
ggplot(result_co2) +
geom_line(mapping = aes(x=Year,y=Nat_gas,color=StateCode))
ggplot(result_co2) +
geom_line(mapping = aes(x=Year,y=Petr,color=StateCode))
The project analyzes energy production and consumption and carbon dioxide emissions in four US states (California, Arizona, New Mexico and Texas). Besides, we established a system called it as Energy Perform Index, to evaluate the performance of energy usage in each state.
Firstly, we find that there are some missing values in the data we should use in the following study. We use greyscale prediction method to fill in the missing values. For some missing values that can not be filled in by this method, we use 0 or the mean value to fill in.
We then performed a descriptive analysis of energy production, energy consumption, and CO2 emissions conditions from 1960 to 2015. In addition, in order to better understand the energy structure, we also divide energy into two types of clean energy and non-clean energy, and then analyzed the use of these two types of energy in each state. We have found that during this time, the share of clean energy is increasing, and Texas produced and consumed the most energy. In terms of CO2, we analyzed CO2 emissions by state, by fuel type, and by social sector. The result is that the overall emission of carbon dioxide is on the rise from 1960 to 2015, except for a significant drop in 2006 and Texas has the most CO2 emission.
After analyzing the basic situation of each state, we next build the evaluation system. We divide all the variables to three main indexes, namely, Environmental Sustainability, Economic Growth and Development and Energy Access and Security. Each index is calculated from four sub-indices, which are further calculated from the basic indicators (Formula is listed in sector 4) With all the index prepared, we then use AHP to determine the weights of each indicators, which is then applied to calculate the final index for each state. From the specific value of the index, different states have their best performance in different aspect. Therefore, it is up to the concerns of users to determine the best performing state.
Finally, with all the production and consumption data prepared, we choose three major energy sources to do the prediction, that is, coal, oil and natural gas. We use Prophet method here and the average error in Arizona State is only 0.29, which means the prediction model is a good one. Besides, in terms of CO2 emission prediction, we use ARIMA method and the MAPE is only 0.13.
In this project, we not only performed a descriptive analysis of the energy production, consumption, and carbon dioxide emissions of the four US states, but also established an evaluation system to assess the energy situation of each state. The entire project process is complete and practically instructive.